Alejandro Schuler and David Connell
2022
Adapted from Steve Bagley and based on R for Data Science by Hadley Wickham
By the end of the course you should be able to…

If you haven't already, please open RStudio on DataHub by clicking this link. If you're viewing this on bCourses, you'll have to right click and then choose “Open Link in New Tab”.
You will get more out of this tutorial if you try out these things in R yourself!!
The R console window is the left (or lower-left) window in RStudio. The R console uses a “read, eval, print” loop. This is sometimes called a REPL.
> 1 + 2
[1] 3
3 is the answer[1] means: the answer is a vector (a list of elements of the same type) and this line starts with the first element of that vector.> 1 +2
> 1+ 2
> 1+2
> 1 + 2
These all do the same thing. The result of each line is 3:
[1] 3
> 1 + 2 * 3 # R respects order of operations
[1] 7
> 3/4
[1] 0.75
> 6^3
[1] 216
> log(10) # natural log
[1] 2.302585
> log10(10) # log base 10
[1] 1
> sqrt(16)
[1] 4
> c(2.1, -4, 22)
[1] 2.1 -4.0 22.0
c( ) function, which is short for “combine”> 1:50
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
: is a handy shortcut to create a vector that is
a sequence of integers from the first number to the second number
(inclusive).[ ] notation. The second output line starts
with 26, which is the 26th element of the vector.An operation is elementwise (or element-wise) if the action you perform on a vector produces a vector with the same dimensions as the original.
The code below multiplies each element of 1:10 by the corresponding
element of 1:10, that is, it squares each element.
> (1:10)*(1:10)
[1] 1 4 9 16 25 36 49 64 81 100
> (1:10)^2
[1] 1 4 9 16 25 36 49 64 81 100
: has a higher precedence than addition +.> 1 + 0:10
[1] 1 2 3 4 5 6 7 8 9 10 11
> 0:10 + 1 # which operator gets executed first?
[1] 1 2 3 4 5 6 7 8 9 10 11
> (0:10) + 1
[1] 1 2 3 4 5 6 7 8 9 10 11
> 0:(10 + 1)
[1] 0 1 2 3 4 5 6 7 8 9 10 11
> x <- 10
> x
[1] 10
> x / 5
[1] 2
/ is the division operator.In R, there are (unfortunately) two assignment operators. They have subtly different meanings (more details later).
<- requires that you type two characters. Don't put a
space between < and -. (What would happen?)Option -” (Mac) or “Alt -” (PC)
to type this using one key combination.= is easier to type.> x <- 10
> x
[1] 10
> x = 20
> x
[1] 20
<- to reduce confusion with the comparison operator == (more on that later).> x <- 10
> x
[1] 10
> x <- x + 1
> x
[1] 11
x and y everywhere.Main.database.first.object.header.length).?make.names for the complete rules on
what can be a name.> a <- 1
> A # this causes an error because A does not have a value
Error: object 'A' not found
There are different conventions for constructing compound names. Warning: disputes over the right way to do this can get heated.
stringlength
string.length
StringLength (CamelCase)
stringLength
string_length (underscore or underbar a.k.a. snake_case)
string-length (hyphen a.k.a. kebab-case)
> for <- 7 # this causes an error
for is a reserved word in R. (It is used in loop control.)?Reserved for the complete rules.> my_age_end_of_year = 31
> this_year = 2022
> my_birth_year = this_year - my_age_end_of_year
> my_birth_year
[1] 1991
Source: OOMPH course PHW251 - R for Public Health
> sqrt(2)
[1] 1.414214
> sqrt(0:10)
[1] 0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
[9] 2.828427 3.000000 3.162278
> x <- 4
> sqrt(x)
[1] 2
> x
[1] 4
> y <- sqrt(x)
> y
[1] 2
> x <- 10
> y
[1] 2
y after changing the value of x?x remains the same after sqrt(x))y), it keeps its value until updated, even if you change other variables (x) that went into the original assignment of that variable> sum
sum, then hit the TAB key (or just wait a second)sum.RETURN or ENTER to select the current
item.Type ?name for help on name. Example:
> ?log
log function (and related functions) in the Help pane, including the name and meaning of the arguments and returned values. > ?"+"
+ operator.> weights <- c(1.1, 2.2, 3.3)
> weights <- c(1.1, 2.2, 3.3)
> # this divides the weights, element-wise, by the conversion factor:
> weights / 2.2
[1] 0.5 1.0 1.5
> shoesize <- c(9, 12, 6, 10, 10, 16, 8, 4)
> shoesize
[1] 9 12 6 10 10 16 8 4
> sum(shoesize)
[1] 75
> sum(shoesize)/length(shoesize)
[1] 9.375
> mean(shoesize)
[1] 9.375
> x <- c(7, 3, 1, 9)
x from x, and then sum
the result.> x <- c(7, 3, 1, 9)
> mean(x)
[1] 5
> x - mean(x)
[1] 2 -2 -4 4
> sum(x - mean(x)) # answer in one expression
[1] 0
> m <- 13
> se <- 0.25
m (mean), and se (standard error), construct a vector containing the two values, \( m \pm 2 \times se \).[1] 12.5 13.5
> ## one way:
> c(m - 2*se, m + 2*se)
[1] 12.5 13.5
> ## another way:
> m + c(-2, 2)*se
[1] 12.5 13.5
> 1:5
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
seq is the function equivalent of the colon operator.> seq(from = 1, to = 5)
[1] 1 2 3 4 5
> seq(to = 5, from = 1) # identical result
[1] 1 2 3 4 5
= value.<- in place of = when specifying
named arguments.> seq(1, 5)
[1] 1 2 3 4 5
> seq(from = 1, to = 5)
[1] 1 2 3 4 5
> seq(begin = 1, end = 5)
Warning: In seq.default(begin = 1, end = 5) :
extra arguments 'begin', 'end' will be disregarded
[1] 1
> # Try this:
> ?seq
> install.packages("name_of_package")
Try this now:
> install.packages("tidyverse")
library function to load an installed package.library loads a package,
not a library.> library("name_of_package")
> ?filter # returns documentation for a function called filter in the stats package
> library(dplyr)
> ?filter # now returns documentation for a function called filter in the dplyr package!
:: before the function name> ?stats::filter
> ?dplyr::filter
factorial(1:10)Command-RETURN (Mac), or Ctrl-ENTER (Windows).Code menu for other commands.> # This is a comment
> 1 + 2 # add some numbers
[1] 3
# to start a comment.If you're working in R locally (installed on your computer), you will need to install the tidyverse package. If you're on DataHub it has already been installed.
> install.packages("tidyverse")
> library("tidyverse")
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ tibble 3.1.5 ✓ dplyr 1.0.7
✓ tidyr 1.1.4 ✓ stringr 1.4.0
✓ readr 2.0.2 ✓ forcats 0.5.1
✓ purrr 0.3.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library("tidyverse") at the top of every script file.A data frame is one of the most powerful features in R.
> mtc
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
tibble is a kind of data frame. This one has 32 rows and 11 columns. We only see the first 10 rows because of limited slide/screen space.<dbl>, means double-precision floating point number, which is a computer science term for any number with a decimal point in it (e.g. 1.3333, 3.14159, 1.0)> mtc
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
> mtc = read_csv("https://tinyurl.com/mtcars-csv")
read_csv (from the readr package, part of tidyverse) reads in data frames that are stored in .csv files (.csv = comma-separated values)read_csv("path/to/file/mtcars.csv")?read_csv to learn a bit more.csvs can also be exported from spreadsheets and databases and then saved locally to be read into R.tibble() to make your own data frames from scratch in R> my_data = tibble( # newlines don't do anything, just increase code readability
+ mrn = c(1, 2, 3, 4),
+ age = c(33, 48, 8, 29)
+ )
> my_data
# A tibble: 4 × 2
mrn age
<dbl> <dbl>
1 1 33
2 2 48
3 3 8
4 4 29
dim() gives the dimensions of the data frame. ncol() and nrow() give you the number of columns and the number of rows, respectively.> dim(my_data)
[1] 4 2
> ncol(my_data)
[1] 2
> nrow(my_data)
[1] 4
names() gives you the names of the columns (a vector)> names(my_data)
[1] "mrn" "age"
glimpse() shows you a lot of information> glimpse(my_data)
Rows: 4
Columns: 2
$ mrn <dbl> 1, 2, 3, 4
$ age <dbl> 33, 48, 8, 29
head() returns the first n rows> head(my_data, n=2)
# A tibble: 2 × 2
mrn age
<dbl> <dbl>
1 1 33
2 2 48
The rest of this section shows the basic data frame functions (“verbs”) in the dplyr package (part of tidyverse). Each operation takes a data frame and produces a new data frame.
filter() picks out rows according to specified conditionsselect() picks out columns according to their namesarrange() sorts the row by values in some column(s)mutate() creates new columns, often based on operations on other columnsThese can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let’s dive in and see how these verbs work.
> filter(mtc, mpg >= 25)
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
6 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
> filter(mtc, mpg >= 25, qsec < 19)
# A tibble: 4 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
3 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
> filter(mtc, mpg > 60)
# A tibble: 0 × 11
# … with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>, drat <dbl>,
# wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
== tests for equality (do not use = which is for assignment)> and < test for greater-than and less-than>= and <= are greater-than-or-equal and less-than-or-equal> c(1,5,-22,4) > 0
[1] TRUE TRUE FALSE TRUE
> filter(mtc, mpg > 30 | mpg < 20)
# A tibble: 22 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
3 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
4 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
5 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
6 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
8 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
# … with 12 more rows
| stands for OR, & is AND& inside filter()> filter(mtc, !(mpg > 30 | mpg < 20))
# A tibble: 10 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
! is NOT, which negates the logical condition> filter(mtc, cyl %in% c(6,8)) # equivalent to filter(mtc, cyl==6 | cyl==8)
# A tibble: 21 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
5 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
8 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
# … with 11 more rows
%in% returns true for all elements of the thing on the left that are also elements of the thing on the righthp) greater than 200?> filter(mtc, hp > 200)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
3 10.4 8 460 215 3 5.42 17.8 0 0 3 4
4 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
5 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
7 15 8 301 335 3.54 3.57 14.6 0 1 5 8
mpg between 15 and 20.> filter(mtc, mpg > 15, mpg < 20)
# A tibble: 12 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
3 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
4 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
6 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
9 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
10 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
11 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
12 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
> filter(mtc, row_number()<=3)
# A tibble: 3 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
row_number() to get specific rows. This is more useful once you have sorted the data in a particular order, which we will soon see how to do.> sample_n(mtc, 5)
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 460 215 3 5.42 17.8 0 0 3 4
2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
3 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
sample_n() to get n randomly selected rows if you don't have a particular condition you would like to filter on.sample_frac() is similar?sample_n() to see how you can sample with replacement or with weights> select(mtc, mpg, qsec, wt)
# A tibble: 32 × 3
mpg qsec wt
<dbl> <dbl> <dbl>
1 21 16.5 2.62
2 21 17.0 2.88
3 22.8 18.6 2.32
4 21.4 19.4 3.22
5 18.7 17.0 3.44
6 18.1 20.2 3.46
7 14.3 15.8 3.57
8 24.4 20 3.19
9 22.8 22.9 3.15
10 19.2 18.3 3.44
# … with 22 more rows
select() can also be used with handy helpers like starts_with() and contains()> select(mtc, starts_with("m"))
# A tibble: 32 × 1
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# … with 22 more rows
select() can also be used with handy helpers like starts_with() and contains()> select(mtc, hp, contains("m"))
# A tibble: 32 × 3
hp mpg am
<dbl> <dbl> <dbl>
1 110 21 1
2 110 21 1
3 93 22.8 1
4 110 21.4 0
5 175 18.7 0
6 105 18.1 0
7 245 14.3 0
8 62 24.4 0
9 95 22.8 0
10 123 19.2 0
# … with 22 more rows
"m" make it a character string (or string for short). If we did not do this, R would think it was looking for a variable called m and not just the plain letter. hp) because the tidyverse functions know that we are working within the dataframe and thus treat the column names like they are variables in their own rightselect() can also be used to select everything except for certain columns by using the minus character -> select(mtc, -contains("m"), -hp)
# A tibble: 32 × 8
cyl disp drat wt qsec vs gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6 160 3.9 2.62 16.5 0 4 4
2 6 160 3.9 2.88 17.0 0 4 4
3 4 108 3.85 2.32 18.6 1 4 1
4 6 258 3.08 3.22 19.4 1 3 1
5 8 360 3.15 3.44 17.0 0 3 2
6 6 225 2.76 3.46 20.2 1 3 1
7 8 360 3.21 3.57 15.8 0 3 4
8 4 147. 3.69 3.19 20 1 4 2
9 4 141. 3.92 3.15 22.9 1 4 2
10 6 168. 3.92 3.44 18.3 1 4 4
# … with 22 more rows
select() has a friend called pull() which returns a vector instead of a (one-column) data frame> select(mtc, hp)
# A tibble: 32 × 1
hp
<dbl>
1 110
2 110
3 93
4 110
5 175
6 105
7 245
8 62
9 95
10 123
# … with 22 more rows
> pull(mtc, hp)
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66 52
[20] 65 97 150 150 245 175 66 91 113 264 175 335 109
> filter(mtc, row_number()==1)
# A tibble: 1 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
> head(mtc)
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
select() and filter() are functions, so they do not modify their input. You can see mtc is unchanged after calling filter() on it. This holds for functions in general.> mtc_first_row = filter(mtc, row_number()==1)
> mtc_first_row
# A tibble: 1 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
> # tmp = select(mtc, mpg, qsec, wt)
> # filter(tmp, mpg >= 25)
> filter(select(mtc, mpg, qsec, wt), mpg >= 25)
# A tibble: 6 × 3
mpg qsec wt
<dbl> <dbl> <dbl>
1 32.4 19.5 2.2
2 30.4 18.5 1.62
3 33.9 19.9 1.84
4 27.3 18.9 1.94
5 26 16.7 2.14
6 30.4 16.9 1.51
arrange takes a data frame and a column, and sorts the rows by the values in that column (ascending order).> powerful <- filter(mtc, hp > 200)
> arrange(powerful, mpg)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
> arrange(powerful, gear, disp)
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
4 10.4 8 460 215 3 5.42 17.8 0 0 3 4
5 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
> arrange(powerful, desc(mpg))
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
2 15 8 301 335 3.54 3.57 14.6 0 1 5 8
3 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
Use arrange() and filter() to get the data for the 5 cars with the highest mpg.
> filter(arrange(mtc, desc(mpg)), row_number()<=5) # "nesting" the calls to filter and arrange
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
or
> cars_by_mpg = arrange(mtc, desc(mpg)) # using a temporary variable
> filter(cars_by_mpg, row_number()<=5)
# A tibble: 5 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
> mtc_vars_subset = select(mtc, mpg, hp)
> mutate(mtc_vars_subset, gpm = 1/mpg)
# A tibble: 32 × 3
mpg hp gpm
<dbl> <dbl> <dbl>
1 21 110 0.0476
2 21 110 0.0476
3 22.8 93 0.0439
4 21.4 110 0.0467
5 18.7 175 0.0535
6 18.1 105 0.0552
7 14.3 245 0.0699
8 24.4 62 0.0410
9 22.8 95 0.0439
10 19.2 123 0.0521
# … with 22 more rows
mutate to add a new column to which is the reciprocal of mpg.= is a new name that you make up which you would like the new column to be called= defines what will go into the new column
-mutate() can create multiple columns at the same time and use multiple columns to define a single new one> mutate(mtc_vars_subset, # the newlines make it more readable
+ gpm = 1/mpg,
+ mpg_hp_ratio = mpg/hp)
# A tibble: 32 × 4
mpg hp gpm mpg_hp_ratio
<dbl> <dbl> <dbl> <dbl>
1 21 110 0.0476 0.191
2 21 110 0.0476 0.191
3 22.8 93 0.0439 0.245
4 21.4 110 0.0467 0.195
5 18.7 175 0.0535 0.107
6 18.1 105 0.0552 0.172
7 14.3 245 0.0699 0.0584
8 24.4 62 0.0410 0.394
9 22.8 95 0.0439 0.24
10 19.2 123 0.0521 0.156
# … with 22 more rows
mtc_vars_subset is unchanged after the mutate.> df = tibble(number = c("1", "2", "3"))
> mutate(df, number_plus_1 = number + 1)
Error: Problem with `mutate()` column `number_plus_1`.
ℹ `number_plus_1 = number + 1`.
x non-numeric argument to binary operator
mutate() is also useful for converting data types, in this case text to numbers> mutate(df, number = as.numeric(number))
# A tibble: 3 × 1
number
<dbl>
1 1
2 2
3 3
filter() picks out rows according to specified conditionsselect() picks out columns according to their namesarrange() sorts the row by values in some column(s)mutate() creates new columns, often based on operations on other columnsAll verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
ggplot2 is a very powerful graphics package.tidyverse.> install.packages("ggplot2")
> library("ggplot2")
> ggplot(data = mtc, mapping = aes(x = hp, y = mpg)) +
+ geom_point()
ggplot2, the function is called simply ggplot()ggplot(data = mtc, mapping = aes(x = hp, y = mpg)) + geom_point()
data = mtc: this tells which tibble contains the data to be plottedmapping = aes(x = hp, y = mpg): use the data in the hp column on x-axis, mpg column on y-axisgeom_point(): plot the data as pointsggplot(mtc, aes(hp, mpg)) +
geom_point()
> ggplot(mtc, aes(hp, mpg)) +
+ geom_line()
> ggplot(mtc, aes(hp, mpg)) +
+ geom_point() +
+ geom_smooth(method="lm")
`geom_smooth()` using formula 'y ~ x'
"lm" means “linear model,” which is a least-squares regression line.> ggplot(mtc, aes(hp, mpg)) +
+ geom_point() +
+ geom_smooth(method="loess")
`geom_smooth()` using formula 'y ~ x'
First, let's load in some new data.
> data1 <- read_csv("https://raw.githubusercontent.com/pre-142-training/r4ds-courses/fa8642362bf9aa1a6423988ea6d2816d7cb9c39f/data/data1.csv")
Rows: 5 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): name, gender
dbl (3): age, weight, shoesize
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> data1
# A tibble: 5 × 5
name gender age weight shoesize
<chr> <chr> <dbl> <dbl> <dbl>
1 Emmet F 10 1 5
2 Jordan M 11 3 8
3 Tala F 20 1.5 5
4 Parker M 25 4 10
5 Riley <NA> 66 5 9
<chr> is short for “character string”, which means text data> data1 %>%
+ ggplot(aes(x = name, shoesize)) +
+ geom_col()
geom_col() is used to make a bar plot. Height of bar is the value for that individualggplot2 is different, and is based on the idea of a “grammar of
graphics,” a set of primitives and rules for combining them in a way
that makes sense for plotting data.aes to map from variables (columns in data frame) to
aesthetics (visual properties of the plot): x, y, color, size,
shape, and others.geom. This determines the type of the plot: point (a
scatterplot), line (line graph or line chart), bar (barplot), and
others.stat (statistical transformation): often identity (do
no transformation), but can be used to count, bin, or summarize
data (e.g., in a histogram).scale. This converts from the units used in the data
frame to the units used for display.ggplot to look for a linear relationship between hp and 1/mpg in our mtc data> ggplot(mtc, aes(hp, 1/mpg)) +
+ geom_point() +
+ geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula 'y ~ x'
> mtc %>%
> mutate(gpm = 1/mpg) %>%
> ggplot(aes(hp, gpm)) +
> geom_point() +
> geom_smooth(method="lm", se=FALSE)
> orange <- as_tibble(Orange) # this data is pre-loaded into R
> orange %>%
+ filter(Tree == 2) %>%
+ ggplot(aes(age, circumference)) +
+ geom_point()
age > 1000> orange %>%
+ filter(Tree == 2, age > 1000) %>%
+ ggplot(aes(age, circumference)) +
+ geom_point()
circum_in which is the circumference in inches, not in millimeters.> mutate(orange, circum_in = circumference/(10 * 2.54))
# A tibble: 35 × 4
Tree age circumference circum_in
<ord> <dbl> <dbl> <dbl>
1 1 118 30 1.18
2 1 484 58 2.28
3 1 664 87 3.43
4 1 1004 115 4.53
5 1 1231 120 4.72
6 1 1372 142 5.59
7 1 1582 145 5.71
8 2 118 33 1.30
9 2 484 69 2.72
10 2 664 111 4.37
# … with 25 more rows
